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1. INTRODUCTION 

Reliable research demands data of known quality, however measuring the quality of data is a critical 
step in any data analysis task. Data quality has been a hot topic for many years and continues to receive a great 
deal of attention [1]-[3], as the value of data is highly dependent on its quality and ease of use. This means that 
the correct use of high-quality data can help to better plan, analyze, and decide. However, poor-quality data 
can be inappropriate for the intended purpose and the consequences can be serious. Studies on data quality 
have been carried out in various fields, ranging from data-driven domains such as big data and statistics to 
domain-driven domains such as healthcare, finance, and information systems. 

To enhance research in the field of data quality, it is important to have a comprehensive understanding 
of the current state of knowledge. This involves identifying, evaluating, and analyzing relevant research con- 
ducted in recent years. Hence, this research paper is intended to provide an in-depth review and investigation 
on data quality in order to answer the question of how data quality can effectively be assessed. Through the sys- 
tematic literature review (SLR), a rigorous review and analysis of empirical studies published between 2016 and 
2021 were conducted. This process enabled the identification of existing research domains associated with data 
quality. Furthermore, the study compiled various data quality models proposed by other researchers, providing 
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a comprehensive overview of the different conceptual frameworks in use. Additionally, the research delved 
into proposed methodologies, metrics for measuring data quality, shedding light on the tools, and approaches 
available for data quality assessment (DQA). 

The findings obtained through the SLR reveal the critical nature of data quality. They highlight the sig- 
nificance of ensuring high-quality data and emphasize the need for ongoing research in this area. The identified 
gaps and emerging trends in the field of data quality suggest new directions for future studies. These findings 
carry implications for both researchers and practitioners, offering valuable insights for defining, and assessing 
data quality effectively. Overall, the SLR contributes to advancing knowledge in the field of data quality by 
providing a comprehensive synthesis of existing research, identifying key areas for further exploration, and 
offering guidance to researchers and practitioners interested in enhancing their understanding and management 
of data quality. 

The remainder of this research study is structured as follows: in section 2, we present the background 
and preliminaries, including an overview of data quality, data quality dimensions, and DQA. We discuss the 
research method and define the following comprehensive phases of the systematic review in section 3. Section 
4 provides and analyzes our findings, while section 5 describes potential research areas and highlights the 
review’s limitations. Lastly, in section 6 we give our findings and future work. 


2. BACKGROUND 

Data quality and DQA form the central focus of this systematic review, this section aims to provide an 
introductory understanding of these key concepts. By examining various dimensions and factors that contribute 
to data quality, this systematic review aims to shed light on effective approaches for assessing and improving 
data quality. 


2.1. Data quality and data quality dimensions 

The ISO 9000 standard defined the term ’quality” as the extent to which the consumer needs are met, 
by considering all characteristics required by the customer for the product or service [4]. Generally, data quality 
concept refers to the adequacy of data to achieve the intended purpose [5]. In other words, it is a requirement 
that the user anticipates executing or a data value that the user anticipates obtaining. It is known as a multi- 
dimensional construct, since that several elements must be taken into account. These elements are described 
by data quality dimensions, which are evaluated using predetermined metrics [6]. 

Basically, it can be seen that the same data have various uses, this can lead some users to judge 
the quality of data to be high, while others judge it to be low. Thus, this quality of data may have a dual 
characteristic and be subjective, as a result, it may meet expectations and specifications. In other words, data 
quality is context dependent |7]. In order to make a decision, relevant information is sought from the data after 
it has been analyzed and processed in order to indicate the level of quality, by performing evaluative measures 
using a specific DQA model [8]. 

According to Wang and Strong [T], data quality dimensions are a collection of data quality attributes 
that each reflect a distinct feature or construct of data quality, i.e. each dimension concerns a specific aspect. 
Therefore, measuring data quality requires conducting DQAs to help determine how well this data adequately 
meet user needs |9]. The literature on data quality encompasses a range of dimensions, with accuracy, com- 
pleteness, consistency, and timeliness emerging as the most frequently mentioned ones. These dimensions 
capture crucial aspects of data quality and serve as foundational elements for assessing data reliability and 
fitness for purpose. However, it is worth noting that researchers have proposed various categorizations and def- 
initions for data quality dimensions, resulting in multiple categories being identified in the literature [10J-[13). 
For instance, Weikum develops a visionary classification of data quality criteria, distinguishing between 
system, process and data-focused criteria as shown in Table[I] 


Table 1. Classification criteria of 


System-centric notions Process-centric notions Data-centric notions 

Reliability, availability, integrity, Safety properties and liveness proper- Accuracy, comprehensiveness, timeliness, 
security, performance and verifia- ties credibility, cost-effectivity and latency 
bility 
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2.2. Data quality assessment 

DQA is a crucial component of every data analysis operation. It is the process of determining whether 
the data meet a user’s information requirements in a particular use case [7]. It involves measuring the quality 
dimensions or criteria that are relevant and comparing the findings to the user’s quality needs [15]. One of 
the challenges in DQA arises from the contextual nature of data quality. It is essential to consider the specific 
context in which data is used, as the evaluation of data quality is highly dependent on the intended purpose and 
the particular scenario. The same quality dimension may hold different relevance and require distinct evaluation 
methods in different contexts [7]. This contextual variability adds complexity to the assessment process and 
necessitates a nuanced understanding of the specific circumstances surrounding data usage. 

Quality models play a significant role in DQA by providing detailed specifications of quality mea- 
sures. These models offer guidance on the selection and application of appropriate measures for assessing data 
quality. They outline the definitions, scales, and formulas associated with each measure, enabling researchers 
and practitioners to determine the most suitable approach for measuring specific aspects of data quality. By 
indicating which measures are relevant and how they should be measured, quality models contribute to a stan- 
dardized and systematic evaluation of data quality [16]. This ensures consistency and facilitates meaningful 
comparisons across different datasets and contexts. The utilization of quality models enhances the effectiveness 
and accuracy of DQA processes. Indeed, judging the quality of data frequently necessitates the computation of 
a large number of quality measures rather than a single measure [8]. 


3. RESEARCH METHOD 

The goal of this research is to perform a comprehensive literature review on DQA using the original 
SLR guidelines given in [17]. According to these guidelines, our SLR consists of three essential activities: 
planning, conducting, and reporting the review [17]. Each activity is associated with several steps. Figure [I] 
depicts a summary of the study’s implementation. 


í Definition of the objectives of a SLR a | Selection studies ] or Analysis of results 
= 
B = 
2 B 
So í Formulate research question 3 | Quality assessment | = [ Repon writing | 
Š š Š 
S [ Identification of search string 8 | Data extraction | % 
= 
AY 
| Selection of sources | Data synthesis | 
| Selection criteria 


Figure 1. SLR process 


3.1. Main objective 

The main objective of this SLR is to provide a comprehensive overview of recent research conducted 
in the field of data quality. The focus is on understanding and evaluating the existing approaches, method- 
ologies, and techniques used for DQA. By analyzing and assessing these approaches, the aim is to enhance 
the overall understanding of data quality and provide insights that can improve the accuracy and reliability of 
research outcomes. By achieving this objective, the SLR aims to provide valuable insights and recommenda- 
tions for researchers, practitioners, and decision-makers in their pursuit of defining, evaluating, and improving 
data quality. Ultimately, the goal is to contribute to the overall improvement of DQA practices, enhance the 
credibility, and impact of research outcomes. 


3.2. Research questions 


As research questions are used to determine and guide the review process, refining them is the most 
challenging task of any systematic review. The following are our research questions: 


— RQI1: what are the existing research domains related to data quality? 
— RQ2: what data quality models are crucial for assessing data quality? 
— RQ3: which methodologies were utilized to assess data quality? 

— RQ4: what are the quality metrics used in DQA? 
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3.3. Search process 

Besides the research questions that guide the SLR, some added questions need to be considered in 
order to begin and build a good research strategy [17]. First of all, it is necessary to specify which time period 
should be taken into account, what search strings to be searched, in which searching database or sources, and 
what are the criteria to be used for selecting studies. We answer respectively these questions in the next sections. 


3.3.1. Timing 

In our study, we focused on recent works by setting a specific timeframe for our search process. We 
limited our search to articles published between 2016 and 2021 to ensure that we captured the most up-to-date 
research in the field of data quality. This timeframe allows us to gain insights into current trends, advancements, 
and emerging areas of research within the specified time period. 


3.3.2. Search string 

The search string should be structured to reflect the words frequently used in the titles of relevant 
articles found by the reference search results. Keywords for searching articles were created using numerous 
criteria, including the identification of keywords based on research questions, use of synonyms, acronyms, and 
alternative spellings, and use of the Boolean ’AND’ and ’OR’ operators to connect keywords. This resulted in 
the following final search string: (“data quality” AND (“quality assessment” OR ’quality evaluation’’)). 


3.3.3. Sources and selection criteria 

Manually searching for all relevant articles in conferences and journals is extremely time consuming, 
an online search was performed in three high quality digital libraries (Scopus, ACM, and DBLP) as they include 
all relevant publications and conferences proceedings. After doing preliminary research on the three digital 
libraries, the resulting collection of articles must be sorted based on titles, years of publication, keywords, 
and abstracts. We then used the following criteria shown in Table [2] to determine whether or not to choose 
preliminary research for further processing. 


Table 2. Inclusion and exclusion criteria used 


Inclusion criteria Exclusion criteria 

Studies published between 2016 and 2021 Duplicate studies 

Studies published in English Non-peer-reviewed studies 

Full studies focusing on data quality area Magazines, tutorials, and editorials 


Studies pertaining to our research questions (RQ1-RQ4) Research that do not consider data quality 


3.4. Quality assessment 

After the full-text reading of the articles, the quality of each publication was assessed to prove and 
strengthen the efficiency of our study. So that, we attempt to provide a quality assessment model that can be 
used to determine the relevance of each paper and whether it contains the requested information, based on its 
score. Thus, our model is provided as a formula that uses seven assessment criteria (AC) whose values are 
either 1 or 0 (i.e. yes or no) to compute and evaluate the final score of each paper by giving a coefficient 0 to 
each AC based on its level of relevance. Our quality evaluation model is represented by: 


S(m) = i, AC; 


High coefficient for papers proposing a novel quality assessment model (AC2) and suggesting evalua- 
tion metrics (AC3), medium coefficient for papers presenting a framework or a quality evaluation methodology 
(AC4) and include simulations and prototypes (AC5). We establish the lowest coefficient for articles that spec- 
ify data quality (AC1), outline a state of the art (AC6), and perform polling (AC7). Table [3] summarizes our 
quality assessment model. 


3.5. Data extraction and synthesis 

Data extraction is conducted for each of the 100 publications chosen and the results are displayed 
in an Excel file. Titles, authors, year of publication, publication type, and the domain of application were 
the columns included in the collected data for all the sources in this study. Following the extraction step, the 
retrieved data is analyzed to answer the aforementioned research questions. During data synthesis phase, we 
collected, processed, and summarized the results of relevant studies in order to answer our research questions. 
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Data was presented in tables, graphs, plots, and charts. The results of this study are then shown as maps to help 
better understand the data quality landscape. 


Table 3. Quality AC 


AC Description Level 0 
AC1 Data quality definition Low 1 
AC2 Quality evaluation model High 2 
AC3 Assessment metrics High 2 
AC4 Methodology or framework Medium 1.5 
AC5 Simulation or prototype Medium 1.5 
AC6 State of the art Low 1 
AC7 Conducting a polling Low 1 


4. RESULTS 
This section presents the findings of SLR, which was conducted using the research method described 
in section 3. We discuss the overview of the selected publications and provide an analysis of the results. 


4.1. Overview of selected studies 

A set of 987 publications that were found through automated searches of electronic data sources were 
utilized during the search procedure. The titles were then checked for duplicate research and 794 candidates 
were ready to move on to the next round of processing using the previously described inclusion and exclusion 
criteria. After the exclusion criteria were applied, the number of papers that needed to be read for the systematic 
review dropped from 794 to 340, only 138 of those 340 publications were deemed to be ”relevant”. As a 
consequence, the seven quality evaluation criteria listed in Table [3] were used to ensure that the incorporated 
findings will contribute significantly to SLR. As a result, 52 papers with a score of less than or equal to 2 
were eliminated. Eventually, 100 papers were chosen to respond four research questions. Figure |2| gives a 
comprehensive overview of selection process. 


ACM 069 ACM 28 
BDPL 188 i : a 


i 987 


H 00 
3 ; ; 32 
' i 
Performing Digital ' Filtring repeated Applying selection criteria Reading full i Applying quality H 
O Library search papers based on titles and abstracts text assessment 


Figure 2. A summary of selecting process 


4.2. Classification of selected research 

We further assessed the outcomes in order to provide a summary of the publishing trends in area of 
data quality. The results of search process are displayed in Table [4] along with the number of studies that were 
chosen based on the data sources and the years of publication. Figures Bia) and (b) shows the distribution of 
100 relevant publications by year (2016-2021). Our results show that the number of articles on data quality 
starts to increase from 2018 with 74 articles published during the last 4 years, although the results for 2021 
are not conclusive because the research was done at the start of 2021. Nevertheless, in 2016 and 2017, 11 and 
15 papers are published, which proves that data quality research is taking a great prominence and the number 
of research articles published is increasing every year. Next, as can be seen from Figure Bic), which displays 
the distribution of papers by data sources, the majority of the articles that were chosen—40 publications, or 
40%—were published in DBLP. 


Table 4. Search process result 
2016 2017 2018 2019 2020 2021 


ACM 2 4 12 4 5 1 
DBLP 7 8 5 13 6 1 
Scopus 2 3 T i 11 2 
Total 11 15 24 24 22 4 
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Figure 3. Reviewed paper statistics: (a) number of publications per source, (b) studies per years, and (c) 
source distribution of selected papers 


4.3. Results of data extraction 

Throughout the systematic review, we addressed four research questions to guide our investigation into 
the field of DQA. These questions were designed to explore the existing literature, analyze the methodologies 
employed, and identify key findings and trends. The findings we report in this section are directly related to 
these research questions and providing valuable insights into the current state of knowledge in the field. 
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4.3.1. RQ1: what are the existing research domains related to data quality? 


Data quality field has captured the attention of researchers since many years. According to Bertossi 
and Rizzolo [18], the assessment of data quality is context-dependent, this means that data quality must be 
considered closely related to the intended use of the data and can only be assessed in that context. So that, 
the topics of discussion are very broad. Classifying research domains in the field of data quality is crucial 
for directing researchers’ attention towards the least explored topics and identifying areas that require further 
investigation. Through our systematic review, we have observed that numerous studies have focused on the 
same topic but have taken different approaches and perspectives. This highlights the breadth and diversity of 
data quality research across various domains. Some of the domains that have been extensively studied include: 


— Big data: with the exponential growth of data, research in data quality within the context of big data has 
become increasingly important. This domain explores the challenges and techniques for ensuring data 
quality in large-scale datasets [19]. 

Artificial intelligence (AI): with the growing role of AI in data analysis, ensuring data quality becomes 

crucial for accurate decision-making. Research in this area focuses on the impact of data quality on AI 

performance and explores strategies to maintain high-quality data for AI applications [20]. 

Healthcare: DQA in healthcare settings has gained significant attention due to the critical role of accurate 

and reliable data in patient care, clinical decision-making, and healthcare outcomes BPI]. 

— Internet of things (IoTs) is a study area that provides limitless prospects and is vital to researchers world- 
wide. It is a technological revolution that will alter the way people work, think, live, and it is strongly 
reliant on the accuracy of data generated by IoT devices [22]. 

— Web: the vast amount of data that is now available on the web has increased user acceptance of web 
technologies in recent years, it become the main source of information. Nevertheless, even with a large 
amount of data, its quality remains doubtful [23]. As a result, investigations on data quality have been 
carried out to evaluate and improve the quality level of provided data. 

— Industry: in the industry, many cases exist where data quality can degrade throughout the life of the 
production system. Beyond aging sensors, it is the entire data collection infrastructure that can introduce 
disruptions. These data collection infrastructures can perform poorly, specifically in terms of network 
parameters such as losses, delays, or traffic load resulting in a decrease in data quality [20]. 

— Information systems: assessing data quality within a complex information system can be challenging due 
to the fact that multiple data sources, both hardware and software are involved. These challenges include 
analyzing and processing a large collection of data in order to ensure data quality [24]. 

— Social media: the social media has emerged as a new source of helpful information, as data from social 
media is considered interesting. This is because if processed correctly, it can help to gain insight in 
business decision making, so that assessing the quality of social data is a context-dependent task [25]. 

— General context: besides these 8 research domains, there are also many other studies that focus on the 
issue of data quality in a general context without necessarily being domain specific. These studies are 
further detailed in the research questions. 


Table [5] provides a comprehensive overview of the classification of selected studies based on their 
respective research domains. This classification enables a better understanding of the distribution and focus of 
research efforts in different domains related to data quality. The classification in this table serves as a valuable 
resource for researchers seeking to gain insights into the current landscape of data quality research across 
various domains. 


Table 5. Data quality research domains 


DQ domains Studies Number 
Big data 15 
IoT 14 
Information systems 14 
Web 11 
Social media 5 
Healthcare 11 
Industry 5 
AI 5 
General context 102)-[122 21 
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4.3.2. RQ2: what data quality models are crucial for assessing data quality? 

After selecting the research domains related to the quality of data, we now focus on the review of 
proposed data quality models to collect dimensions used by researchers and classify each model used with its 
specific domain. 


— Big data: in an attempt to better understand the process of the national standard reference data (SRD) pro- 
gram, which enables the use of beneficial metrological concepts and methodologies to create trustworthy 
data and transform such data into national standards, Lee presented the notion of data traceability 
with three dimensions known as the DQA matrix, which is based on the elements of a data production 
system and related evaluation criteria. According to Zhang et al. [32], recommendation and prediction 
systems should be used as examples for analyzing the quality of big data and identifying its problems at 
each level of the processing. Then, once these issues have been analyzed, the corresponding solution for 
each problem is given from the perspective of data quality. Taleb et al. created a big data quality 
profiling model (BDQPM) with four dimensions: accuracy, completeness, consistency, and timeliness. 
They emphasized the ideas of data profiling, data quality profiling, and the model contains multiple mod- 
ules to inspect the quality of data by offering a set of actions to be implemented in the pre-processing 
phase in particular that is related to the evaluation of data quality. Similarly, Cappiello et al. in their 
data quality service (DQS) module used a model of seven dimensions; accuracy, completeness, consis- 
tency, distinctness, precision, timeliness, and volume. Table [6] presents all the quality dimensions used 
by authors. 

— Healthcare: about medical data, a standard model for assessing data quality in primary health care elec- 
tronic medical records (EMRs) proposed by Terry et al. [83], which contains three process of concep- 
tualizing, developing, and testing. 11 metrics for assessing the quality of EMR data used in primary 
healthcare have been established and tested across three EMR datasets in the domains of comparability, 
completeness, accuracy, and currency. Likewise, Zan and Zhang provided a data quality evaluation 
model with five dimensions; accuracy, consistency, integrity, timeliness, and normative to identify data 
affecting the credibility. A specific process is made to analyze the trust of data source and eliminate 
those that are untrustworthy. Finally, valid data are evaluated by calculating all dimensions of data qual- 
ity. Similarly, quality model is for assessing the electronic health record (EHR) data quality in medical 
informatics in research and care in university medicine (MIRACUM) using standard EHR data quality 
metrics including plausibility, completeness, and conformance [89]. MIRACUM is a partnership of ten 
German university hospitals and business partners that aims to address the issues associated with digi- 
tization and future medical research. It is a member of the German medical informatics initiative (MII) 
[123]. Otherwise, Lee et al. have implemented an existing DQA framework, using definitions of 
several data quality dimensions to propose the harmonized framework that focuses on the categories of 
conformance, completeness, and plausibility. Then, using the DQA framework shown in Table [6] they 
developed an inventory of common phenotypic data elements (CPDEs) obtained from the study datasets 
and analyzed it. Stoldt and Weber proposed a quality model that aim assessing the quality of medical 
data and supporting the clinical decision making. The authors extend the fast healthcare interoperability 
resources (FHIR) model to enable data provenance annotations to be stored in EHRs. They used the 
fuzzy logic to determine the level of reliability of the data produced taking into account the level of trust 
of these data. 

— IoT: in the context of the IoT, Zubair et al. provided a survey on data quality in IoT when they iden- 
tified the characteristics of IoT data inherent and specific to the domains. In addition, their classification 
of IoT data quality are grouped into seven dimensions namely: inaccuracy, completeness, inconsistency, 
ambiguity, uncertainty, timeliness, and credibility. Furthermore, besides those proposed in the litera- 
ture related to IoT, other dimensions could be introduced to assess IoT DQ, such as accessibility, access 
security, and interpretability with two dimensions domains-specific e-health and smart grids such as du- 
plicates and availability [22]. However, Korachi and Bounabat used the DQSC-maturity model for 
providing the ability to define the maturity level of a smart city based on the quality of the data generated 
and consumed by the city, as well as for defining relevant recommendations and solutions required to 
reach the aimed level. Regarding remote sensing, acquisition, and decision-making processes, in 
many data quality dimensions that are directly relevant to sensor data are described, then they proposed 
an approach of data quality evaluation for sensor data that supports domain knowledge aggregation and 
dissemination. However, goal quality model (GQM) is a goal-oriented method of defining software mea- 
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sures with five dimensions; accuracy, correctness, integrity, unique, and validation. For assessing data 
quality, this model begins by determining the quality objective of the data instances and then follows 
the objective until it reaches the problem. Finally, these problems must be able to define the target [97]. 
In order to describe the case of quality assessment of geographic information systems, Puentes et al. 
suggested a quality model which use remote sensing products. Otherwise, concerning dynamic data 
quality, Labouseur and Matheus explained the principle of the concept “one size does not fit all” 
and its relationship with dynamic data, especially dynamic data quality by providing some data quality 
dimensions to construct their model namely accessibility, ease of manipulation, and representation. 


Table 6. Classification of DQ dimensions used 


Categories 


Dimensions 


105 


Data collection 
Data preprocessing 
Data storage 

Data analysis 
Conformance 
Completeness 
Plausibility 

Data properties 
Method and procedure 
Data value and information 
Intrinsic 
Accessibility 
Representational 
Contextual 
Contextual 
Intrinsic 

Intrinsic 
Accessibility 
Representational 
Contextual 
Contextual 
Intrinsic 

Intrinsic 


Contextual 
Extrinsic 


Availability 
Usability 
Reliability 
Relevance 
Presentation 
Syntactic level 
Semantic level 
Pragmatic level 


Accessibility, ease of manipulation, and representation 

Completeness, consistency, and accuracy 

Validity, understandability, reference, uncertainty, and balanced/unbiased 
Access security, accessibility, and interpretability 

Completeness, redundancy, accuracy, and data type 

Individual trustworthiness and global conclusiveness 

Believability, accuracy, precision, free-of-error, completeness, timeliness, and consis- 
tency 

Data completeness and data format 

Accessibility, accuracy, completeness, consistency, timeliness, and relevance 
Accuracy, completeness, consistency, conformity, integrity, and validity 
Accuracy, completeness, consistency, distinctness, precision, timeliness, and volume 
Accuracy, completeness, consistency, timeliness, validity, and uniqueness 
Accuracy, completeness, consistency, and timeliness 

Completeness, validity, accuracy, and currency 

Uniqueness, validity, accuracy, completeness, and timeliness 

Usefulness, complexity, and compliance 

Accuracy, completeness, consistency, and timeliness 

Comparability, completeness, correctness, and currency 

Accuracy, consistency, integrity, timeliness, and normative 

Plausibility, completeness, and conformance 

Inaccuracy, completeness, inconsistency, ambiguity, uncertainty, timeliness, and credi- 
bility 

Accuracy, correctness, integrity, unique, and validation 

Reliability and trust 

Validity, reference, understandability, balanced, and uncertainty 

Availability and relevance 

Usability and reliability 

Usability and availability 

Reliability and usability 

Value conformance, relational conformance, and computational conformance 


Uniqueness plausibility, atemporal plausibility, and temporal plausibility 
Completeness 

Accuracy 

Consistency 

Syntactic validity, semantic accuracy, consistency, conciseness and completeness 
Availability, licensing, interlinking, security, and performance 

Representational conciseness, interoperability, interpretability, and versatility 
Relevance, trustworthiness, understandability, and timeliness 

Relevancy 

Consistency and accuracy 

Accuracy and believability 

Accessibility 

Consistency and interpretability 

Timeliness and completeness 

Correctness and precision 

Relevancy, content diversity completeness, and timeliness 

Source precision, readability, accuracy, resolution, objectivity, integrity, reputation, 
consistency, obsolescence, uniqueness, freshness, and acquisition cost 

Real precision, timeliness, clarity, completeness, trust, concision, value added, volume, 
and believability 

Accessibility, manipulation, security, interpretability, ease of use, compatibility, for- 
mat, understandability, redundancy, and coherence 

Accessibility, timeliness, and authorization 

Credibility, definition, metadata 

Accuracy, integrity, consistency, auditability, and completeness 

Fitness 

Readability, structure 

Completeness, integrity, consistency, validity, maintainability, and timeliness 
Accuracy, coverage 

Relevance, usability, risks, currency and decay 


— Web: Yi addressed the topic of open data quality by concentrating on data completeness and data 
format in order to uncover concerns linked to open data. The author compared open data formats used by 
the governments of three countries: Korea, United Kingdom, and United States. After that, he offered in- 
stances of incomplete data across the three nations to demonstrate the presence of the data quality issue, 
as well as advice for acceptable data formats and data completeness to enhance open data quality. On 
the other hand, Zaveri and Rula presented a quality rating survey for connected data. Their quality 
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dimension categorization is divided into four categories: intrinsic, accessible, contextual, and represen- 
tational. Intrinsic dimensions define data quality from the standpoint of a data provider and independent 
of external variables. The amount to which data is available and retrievable is defined by the accessibility 
dimensions. Contextual dimensions are those that are heavily reliant on the task situation. Representa- 
tion dimensions collect information on the data’s design. From the same classification Luzzu’s model 
[74| is constructed, accessibility, intrinsic, contextual, and representational categories with different di- 
mensions for each category. Nevertheless, regarding the issues of Arabic DBpedia, focused only on 
two categories (intrinsic and contextual) and creating a comparison of current quality assessment tools 
that demonstrates different attributes/functionalities of the available tools in order to offer a new linked 
DQA service that assists developers using Arabic language to swiftly rectify problems in DBpedia before 
utilizing it in a linked data application. 

— Information systems: in data warehouse systems, Singh and Kawaljeet presented a quality model of 
six dimensions, as described in Table [6] associated with four data quality problems, then classified a list 
of the causes of data quality issues at various stages of data warehousing to help users take care of these 
issues. However, Oliveira et al. introduces an expanded version of the mapping between data mining 
challenges and data quality dimensions, with three procedures aiming to improve data quality and detect 
data anomalies using in their model six dimensions; completeness, accuracy, accessibility, consistency, 
timeliness, and relevance. 

— Social media: recently, Salvatore et al. addressed the issue of social media data quality, specifi- 
cally using Twitter as a reference platform. They described new quality factors as reliability, usability, 
availability, relevance, and presentation quality. Likewise, Zengin and Onder proposed a method 
for evaluating the quality of Youtube data concerning a specific issue which is the side effect of biologic 
therapy. The model used to asses the reliability and the quality of videos is presented in Table [6] 

— General context: based on the idea of automating the verification of data quality, Schelter et al. 
presented a declarative application programming interface (API) that enables users to recognize con- 
straints on their datasets focusing on completeness, consistency, and accuracy. They explained how 
these constraints translate into computations of metrics on the data to effectively evaluate the constraints. 
Similarly, Jungbluth et al. has chosen to use in their quality model five dimensions; uniqueness, 
validity, accuracy, completeness, and timeliness. Research by Berghe and Gaeveren about the issue 
of data quality regarding the migration of data from an old to a new system, they ranked cleaning tasks 
according to several criteria, such as usefulness and complexity, while taking compliance into consid- 
eration. Furthermore, Ceravolo and Bellini described a general methodology for DQA structured 
around the notion of matching, which aims at providing a configurable model supporting task composi- 
tion. Their classification of three levels are composed as illustrated in Table [6] To give an illustration, 
three case studies are performed, the first one by [118], which performed a case study to analyze the 
quality of higher education data in one of Indonesia’s institutions, according to the pangkalan data pen- 
didikan tinggi (PDDikti is under the jurisdiction of the Ministry of Research, Technology, and Higher 
Education), such as records of personal data of students, teachers, achievements, and outcomes of pro- 
fessors’ assessments. According to the results of this study, the dimensions of quality of higher education 
data specified by the ministry of research are completeness, validity, accuracy, and currency. The sec- 
ond one is by [107], they identified and examined the quality of data generated by a security incident 
response team in an organization. Then, following the collect of data, both the analysis of these data 
and the conclusions of the interviews conducted with the organization’s security incident response team 
are presented according to the accuracy, consistency, completeness, and the timeliness of the data made 
available to the team. Finally, case report’s elaborated at Manitoba Centre for Health Policy (MCHP) 
describing five key dimensions of data quality framework (DQF) which include accuracy, internal va- 
lidity, external validity, timeliness, and interpretability [119]. This case report is intended to guide and 
provide best practices resource for other research institutes dealing with administrative data and trying 
to enhance their data quality evaluation process. 


The literature on data quality outlines thorough definitions of the dimensions of data quality, however 
there is no agreement on which dimensions characterize data quality or what exactly each dimension means be- 
cause it is contextual and depends on the perspective of each author [124]. Table |6|highlights the classifications 
of DQ dimensions employed in different models in the studies described. 
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4.3.3. RQ3: which methodologies were utilized to assess data quality? 

The literature includes a broad variety of methods for assessing and improving data quality. Because of 
the diversity and complexity of these methodologies, recent research has focused on developing methodologies 
that aid in the selection and application of DQA and improvement techniques. 


— Big data: large volumes of data are generated daily from heterogeneous sources, hence its quality is 
a major concern. The conflict between these large volume of data and the level of uncertainty in the 
quality of this data is always current. In this respect, Baldassarre et al. presented a methodology 
called data quality to smart data (DQ2SD) that consists of four phases: data quality planning, analytics 
rules definition, data quality control, and data quality enhancement, which aims at extracting all of the 
value within the data, so that ensuring that collected data is both correct and appropriate. In other words, 
rather than getting a huge amount of meaningless and untruthful data, we will get valuable data. Besides, 
Klas et al. introduced a new scalable quality assessment approach for big data (SQA4BD) with the 
goal of decoupling data quality analysis from the final individual quality evaluation, which is based on 
the particular quality demands of the future data consumer. As a result, starting with a basic model for 
data quality, then analyzing the data, and ultimately assessing quality in an inter-organizational situation 
is recommended. In such context, three major roles have been defined; data provider who is able to share 
specific data based on conditions to be negotiated, data consumer who can utilize specific data provided 
by the supplier based on whether these data contribute to satisfying his or her information and quality 
needs, and an authority determining the relevant aspects of quality and how they are calculated for various 
types of data. Furthermore, addressing the issue of big data related to data quality and data diagnosticity, 
Ghasemaghaei and Calic employed [1]’s DQF to identify several data quality categories used in their 
study and they used organizational learning theory to characterize the influence of big data utilization on 
data quality categories as well as decision quality. Likewise, Cappiello et al. have created a DQS 
module, seeking to evaluate the quality of a data source using a set of dimensions including the 4 Vs 
features (variety, volume, velocity, and veracity) of big data to get many insights about the quality of the 
examined big data sources. 

— Healthcare: from a medical standpoint, more and more medical institutions are emphasizing the need 
of having an automated framework to maintain the quality of their data effectively. As a result, Pe- 
zoulas et al. presented a web-based framework for medical DQA that focuses on improving clinical 
data completeness, relevance, and accuracy by providing a set of quantitative features for metadata ex- 
traction, data quality control, and data normalization. Similarly, for EHR data, the framework the care 
pathway-data quality framework (CP-DQF) suggested by [84], allow systematic management of data 
quality to assist more trustworthy process mining efforts in EHR research. Likewise, Kapsner et al. 
presented a framework that supports a standardized, unified, and harmonized assessment of EHR data 
quality in MIRACUM utilizing common EHR data quality dimensions. In addition, this framework helps 
to systematically identify data quality issues by individual hospitals and then initiate reporting loops to 
improve its quality. Otherwise, Zaccaria et al. proposed a methodology of five steps that consists of 
improving data quality of a large dataset extracted from a multicenter clinical trial, based on data prepro- 
cessing. Each step aims to solve one of the problems observed during the first step and after each step 
the improvement of the quality of data is evaluated. Within the same background, a scalable framework 
is recommended by for organizing data quality rules, sharing them, and ensuring their reuse across 
health care facilities as rule templates, so that errors and discrepancies in data can be identified. Besides, 
Sun et al. offered a provenance-based method for assessing data quality. A qualitative analysis is 
used to examine the current state of health data. Next, a model of instantiation provenance is included 
for health data analysis. They provided a framework for analyzing health data based on this model and 
used a prototype to verify the efficiency of the proposed method. 

— IoT: regarding IoT field, there are a lot of methodologies and frameworks aiming at both assessing and 
improving the quality of data. For this reason, Aquino et al. proposed hygieia framework for 
measuring data quality in IoT context for constrained smart sensor networks (SSNs) devices and helping 
at imposing low memory overhead and communication overhead in SSN devices. Generally, it is used 
to analyze the quality of IoT data in order to deliver information to IoT applications. On the other hand, 
there is valid. IoT, a framework whose architecture is partitioned into three packages, each one has a 
role to determines the quality of data sources and generate the quality of information vectors to mark-up 
the sensor information based on quality of data metrics [41]. Likewise, Luo et al. proposed a cross 
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validation strategy to solve various data quality issues and give a fresh viewpoint on improving IoT data 
quality by fully using the power of crowds. Otherwise, Ge et al. offered a framework for data quality 
management that was supposed to be particular to the area of smart grids. The authors created a list of 
data quality issues and attempted to utilize it to choose the many associated data quality dimensions in 
order to determine which ones are critical in the smart grid data quality improvement. Next they identify 
seven essential data quality characteristics, each of which is linked to specific smart grid data quality 
issues. 

— Web: in order to measure the quality of linked data, Mihindukulasooriya et al. proposed a solution 
based on Loupe API, a RESTful service configurable for profiling linked data by specifying user require- 
ments and other configuration details. Profiling findings may be used to evaluate and validate the quality 
of a dataset in order to contribute to the quality improvement process by cleaning and fixing deficiencies 
in the data. Additionally, Ahmed were interested in the issue of data integration in the context of 
linked open data (LOD), so they presented the problem of DQA during the integration process, then 
they presented a methodology for evaluating the quality of data in LOD sources during the integration 
process’s phases. Recently, To et al. introduced SydNet, a new framework for assessing the quality 
of linked data. This framework provides an approach which can help network analysts define the dimen- 
sions and metrics of quality that are necessary to provide an accurate consideration of the quality of data 
sources network and help also to merge data from various network data sources. Looking to increase the 
quality of data, a framework was published by to enhance the quality of linked data. Luzzu with an 
extensive library of applied quality measures, as well as including a declarative language to create further 
domain-specific quality measures, as well as a full collection of ontologies for gathering and distributing 
data quality information. Based on a real case study of Colombian open data government, Sanabria et al. 
proposed a methodological process that help to assess the quality of Sabaneta City’s open govern- 
ment data (OGD). This process consists of three stages, beginning with the selection of the data set to be 
assessed and data profiling. The findings and analyses of the data quality evaluation are defined based on 
this data profiling, and the tree dimensions of correctness, consistency, and completeness were reviewed, 
and faults were detected. 

— Information systems: Azeroual et al. presented the different measures and techniques of data clean- 
ing used both to enhance and increase data quality in research information systems (RIS). For that, 
knowing the reasons of poor data quality throughout data collection, data transmission, and data integra- 
tion is critical in order to analyze and then remedy by data transformation and data cleaning. Focusing 
on the conceptual framework of [65], it is considered as an analytic tool that aims at helping users to 
understand and distinguish different concepts such as how quality issues could be presented and how 
potential data quality issues could be classified. This study take in consideration two dimensions of qual- 
ity, global conclusiveness (GC) and individual trustworthiness (IT). Besides, Timmerman and Bronselaer 
[55] which suggested a rule-based framework, designed to identify and then address any issues with data 
quality brought on by improper data collecting and validation methods as well as poor execution. 

— Social media: Berlanga et al. proposed a methodology with the goal of identifying a reliable metric 
to judge and assess the general quality of a group of posts and user profiles from odd perspectives, as 
well as to include the metrics obtained from various quality criteria used to filter the relevance of posts. 

— AI: Arbesser et al. described and discussed Visplause, a system that aims at inspecting data qual- 
ity problems in numerous time series by using time series meta-information and plausibility checks to 
flexibly structure and resume the results of data quality checks. He et al. aimed to investigate the 
link between data quality and model quality, describing four elements of data quality that may be en- 
countered in the field of deep learning. The results then reveal that all four criteria of data quality have a 
considerable influence on the quality of deep neural network (DNN) models. 

— General context: in addition to the above-mentioned methodologies for measuring data quality, there are 
other interesting ones that are not specifically focused on one area. For instance, the data quality valida- 
tion methodology (DQVM) has as its objective the assessment of the effects of bad data quality as well as 
the analysis of associated data quality actions on the results of processes, particularly scoring processes 
that produce as their output an evaluation that is a ranking or rating for an object. This methodology 
suggests a series of stages to examine the consequences of faults, injecting faults methodically through- 
out the process to identify various abnormal circumstances [104]. Furthermore, the total meteorological 
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data quality (TMDQ) framework, based on the total quality management (TQM) approach developed 
by [115], aims to offer observers with diverse meteorological data qualities from numerous perspectives 
according to four quality dimensions, accuracy, consistency, completeness, and timeliness. At the basis 
of this framework, a validation system is developed to assist meteorological observers in more efficiently 
improving and maintaining the quality of meteorological data. Similarly, Li et al. developed a task- 
oriented data quality assessment (TODQA) framework which evaluates data quality from two aspects, 
intrinsic, and contextual quality. It defines, quantify, and fuses assessment metrics to rank candidate data 
sets based on their quality for specific tasks. While Simard et al. have proposed a broad framework 
that focuses on the measurement and categorization of data and tries to offer a measure of the amount of 
uncertainty based on four major characteristics of accuracy, completeness, consistency, and timeliness. 
On the other hand, Kara et al. presented a new approach which propose a global data quality model 
in order to obtain a general method of manipulating and evaluating different factors of data quality. This 
model combines many data quality models that are linked by five types of equivalence relations to indi- 
cate the relationship between two criteria of different models, this relations may be D-Similar, S-Similar, 
D-Same, S-Same, or S-different. Each one of this models has a tree structure and comprises of factors, 
criteria, and sub-criteria. Jungbluth et al. suggested a quality data extraction methodology to assist 
users in the process of extracting data in order to increase the quality of data acquired throughout the pro- 
cess. This approach is divided into four primary stages, each with its own set of actions to be followed as 
a guide or reference. Finally, the methodology DMN4DQ used in tries to simplify the description 
of data quality, which is separated into two levels: measurement and assessment. The final evaluation is 
then calculated by adding the scores from each dimension. 


RQ4: what are the quality metrics used in DQA? 
Having discussed the different methodologies and frameworks proposed for assessing data quality 


and gathered data quality models suggested by researchers, we now proceed to present the different assessment 
metrics used to measure these dimensions. 


Big data: as the goal of is to evaluate data value in terms of data quality, they employed three data 
quality dimensions: accuracy, completeness, and redundancy as shown in Table [7] next based on these 
dimensions a linear model is established to calculate the quality scores. In addition, Liu et al. 
proposed an approximate quality assessment model based on data set sampling to evaluate the quality of 
big data. Utilizing various sample sizes and sampling techniques, the authors chose three dimensions; 
completeness, accuracy, and timeliness to evaluate each sample. In Table[8] the three metrics provided, 
where S is a collection of data units, Sacc is the subset of accurate data units in S, Sep is the subset of 
complete data units in S, N is the cardinality of S, and Sacc and Sep’s combined cardinality is M. 


Table 7. Quality metrics used in 


Accuracy Completeness Redundancy 
1 Nb of data with errors 1 Nb of incomplete data 1 Nb of duplicate data 
Nb total of data Nb total of data Nb total of data 


Table 8. Quality metrics used in 


Accuracy Completeness Timeliness 
Degace = ini = Dego = DM = Degtim = ý De Degeim (ai) 


Taleb et al. suggested a big data quality evaluation system in their study that attempts to improve 
data quality by estimating and assessing data before beginning analysis. As a result, they employed a 
model containing the most frequently utilized quality dimensions in the context of big data; accuracy, 
completeness, and consistency with the metrics shown Table [9 Likewise, Mylavarapu et al. de- 
veloped a data accuracy assessment model without needing domain knowledge to evaluate the accuracy 
of both intrinsic and contextual data. Based on machine learning techniques, they selected the best data 
from a collection of data sets and considered it as the correct one. The formula used in this study is as 
follow, where ACC], represents the new dataset’s intrinsic data accuracy (N), M, and K indicate the 
number of records and variables in the dataset, respectively, /;; and d;,; denote the data item of the ith 
record and the jt” variable of the new and correct datasets. 
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M K llij—dijl 
i=1 Xj=1 (1 maalid) 


ACCra = M*K 


Table 9. Quality metrics used in 


Accuracy Completeness Consistency 
Nb of correct values Nb of missing values Nb of values that respects the constraints 
Total Nb of values of sample data Total Nb of values of sample data Total Nb of values of sample data 


— IoT: the metrics utilized by in Valid.loT model are formalized in Table [10] 


If the sensor data is derived from a single IoT hardware sensor and is 
t regated or int lated 
Areas not aggregated or interpolate: () 
0, An unnamed data source that aggregates data using unidentified 


algorithms 


With C,(x%o,2;) the individual concordance and a distance function based on infrastructure and prop- 
agation between sensor locations a and b for sensors I and j is called d(a;, bj). 


Table 10. Quality metrics used in 


Completeness Timeliness Concordance 
n Co(®0>%4) 
1—Nb of missing data i=1 “d@o.24)_ 


Nb of expected data Current-timestamp — Message-timestamp =e, feta 
=1 d(z0.z; 


— Web: according to the study conducted on three Chinese local government datasets in Beijing, Guangzhou, 
and Harbin, there are sixteen types of quality problems (Pi, 1jij 16) that could affect data availability such 
as; misalignment, data are too coarse and too granular, text is jumbled, missing values, and date formats 
vary. For this reason, Li et al. classified seven quality characteristics and metrics at various levels 
and then associate each dimension with the appropriate quality problems to score these three datasets. 
The results of this evaluation show that the total score for completeness, correctness, and consistency is 
poor, which will lead consumers to make the incorrect conclusion. In what follow, Table [I 1]presents the 
classification of quality evaluation metrics. 


1—Nb of incomplete data 
Nb total of data 


Completeness = 


Table 11. Quality metrics used in 


Accuracy Consistency Openness 
{ 1, no quality problems { 1, no quality problems { 1, no quality problems 
0, at least one quality problems (P3-P7)|0, at least one quality problems (P8-P10) | 0, at least one quality problems(P15-P16) 
Timeliness Uniqueness Understandability 
1, no quality problems 1, no quality problems 1, no quality problems 
a at least one quality problems P12 0, at least one quality problems P13 E at least one quality problems(P1-P2-P14) 


— Industry: the primary contribution of is the development of a general model for objectively assessing 
the quality of industrial signal data. This model can give a beneficial decision foundation in data mining 
procedures about the effective and efficient utilization of accessible data. Additionally, Guo et al. 
gave a theoretical study of data quality, they defined and supplied various techniques and technologies 
to enhance data quality. Ultimately, a data quality evaluation model is given and a model calculation 
approach is used to compute the outcome of each dataset in this model. 


— Xi- Wi*Li 
D= Dt Wi 


4=1 % 
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— Xi Wix Ei 
R= -ye W 


i=1 
With D representing the real data quality of the data set T, R representing the difference between D and 
the expected value, RT representing the rule set of data set T, W; representing the weight of rule R; in 
RT, Fi representing the expected value, and L; representing the result of computation. 

— General context: Simard et al. proposed a broad framework that focuses on the measurement and 
categorization of data and tries to offer a measure of the amount of uncertainty, concentrating on the 
four primary characteristics of accuracy, completeness, consistency, and timeliness. Table [M]include the 
defined quality evaluation measures. According to Aljumaili et al. [102], model for the assessment of 
data quality which takes in consideration the models of (125), to present a model based on metadata 
and content analysis. In addition, a software tool is developed to validate the proposed measures and to 
provide an overview of metadata quality. Table [13] presents the quality metrics chosen by [102]. 


Table 12. Quality metrics used in [114] 


Accuracy Completeness Timeliness Consistency 
RV=MV 
Sea Nb of incomplete value Nb of inconsistent value 
=- g = PE p 
1 | 2 | 1 Total Nb of value 1 |Accuracy 1 Accuracy2| 1 Total Nb of Value 
Table 13. Quality metrics used in [102] 

Accuracy Completeness Redundancy Data type 

1 Nb of data in error 1 Nb of incomplete data 1 Nb of redundant data 1 Nb of data violationg datatype 


It remains almost the same dimensions used for the meteorological data, however, the specific metrics 
and criteria used to evaluate the quality may vary based on the unique characteristics and requirements 
of meteorological data. Tsai and Chan proposed the metrics shown in Table [14] Finally, Liu et 
al. provided a summary on the current situation of research on data quality, they analyzed most 
pertinent characteristics of quality and defined evaluation criteria based on user-defined requirements. 
The established DQA model includes the major dimensions of data quality, quality characteristics, and 
quality indicators. Finally, a dynamic data quality evaluation process is built on the basis of the model 


shown in Table 
Table 14. Quality metrics used in [115] 
Accuracy Completeness Timeliness 
1 Nb of failed range checks 1 Nb of missing data 1 Nb of data not corrected 


Table 15. Assessment metrics in | 110! 


Dimensions Indicators Metrics 
Completeness Attribute completeness All data that match the criteria/all data that participated in the 
evaluation of this indicator 
Record completeness Nb of records meeting all the criteria/Nb of records participat- 
ing in the evaluation of this indicator 
Accuracy Range accuracy All data that match the criteria/all data that participated in the 
evaluation of this indicator 
Consistency Reference consistency Nb of data of all eligible rows/Nb of all rows participating in 
the evaluation 
Format consistency Columns eligible/all columns involved 
Confidentially Source confidentially Average points based on survey results 
Recoverability Periodic backup Average points based on survey results 
Traceability Access traceability Average points based on survey results 
Value traceability Average points based on survey results 
Understandability Data understandability Average points based on survey results 


Data model understandability Average points based on survey results 
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5. DISCUSSION 

In this section, we provide a comprehensive overview and discussion of the findings and results ob- 
tained in response to the research questions posed in RQ1, RQ2, RQ3, and RQ4. Through mapping these 
findings, we aim to illustrate the current landscape of data quality research and highlighting key insights and 
trends. 


5.1. Discussion of findings 

Since the context is essential to determine which data is relevant to the user, as the data must be tailored 
to the user’s environment and the environment determines the context, RQ1 aims to discover the application 
domains of data quality. By examining the distribution across domains, researchers and practitioners can 
identify areas that have received more attention and those that may require further exploration and investigation. 
In this research question, we have observed that certain research domains related to data quality have received 
more attention than others. Specifically, the domains of big data, IoT, and information systems have been 
extensively studied, with a significant number of papers dedicated to exploring the quality aspects in these 
areas. The prominence of these domains highlights their importance and the need for effective data quality 
management in these contexts. Additionally, the web has emerged as a significant domain of interest, reflecting 
its growing role as a major source of information. Healthcare data has also garnered attention, emphasizing the 
crucial role of data quality in improving healthcare systems. Furthermore, social media, AI, and industry have 
also been subjects of research, albeit to a lesser extent. The overall distribution of publications across these 
domains is depicted in Figure] providing a visual representation of the research focus in different application 
areas. 

One of our objectives in this systematic review was to identify quality models to determine data quality 
dimensions commonly employed by researchers. Each of these data quality dimensions may establish quality 
metrics for evaluating data quality. As a result, the majority of research between 2016 and 2021 concentrated 
on the data quality model. 39 publications investigated the link between data quality dimensions, detailed cur- 
rent data quality management difficulties, and assessed existing data quality methodologies. As a result, RQ2 
has over 54 quality dimensions. The most often utilized dimensions to measure data quality in the studies re- 
viewed are completeness, correctness, consistency, and timeliness. Although there are some differences in their 
definitions due to the contextual nature of quality, they are universal for any data quality evaluation, regardless 
of its significance. In order to give readers a better understanding, the plot of the percentage utilization of the 
dimensions employed in each research is shown in Figure [5] The most common data quality dimensions, as 
can be seen, are completeness, correctness, consistency, and timeliness. 

Research on data quality has covered wide topics of discussion in various domains, as shown in RQ1. 
In addition, a number of methodologies and frameworks have been disclosed by the authors for an improved 
evaluation strategy and to overcome data quality issues. For this reason, RQ3 reviewed the proposed method- 
ologies and frameworks to provide a systematic and comparative description of existing data quality method- 
ologies. Analyzing the proposed methodologies revealed a rising need for new approaches to assessing data 
quality that are more effective and scalable. 


25 


20 


Number of studies 


Big data IOT Web Healthcare Social media Industry Information Al General 
systems 
Domains 


Figure 4. The overall distribution of publications across research domains 
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Figure 5. Representation of our findings by percentage of use of quality dimensions 


Data quality metrics are used to analyze data quality by measuring accuracy, completeness, consis- 
tency, timeliness, and other aspects of data quality. There are two types of metrics: qualitative metrics and 
quantitative metrics. Quantitative metrics are those that can be quantified or assigned a numerical value. Qual- 
itative measurements cannot be defined and are based on user impression. Finally, RQ4 revealed the evaluation 
metrics most used. We can conclude from the studies of existing metrics that there is no single formula to 
measure quality. It is dependent on the context and use of this facts. Indeed, only 12 research publications were 
identified linked to data quality measures out of the 100 final studies reviewed and there are 40 metrics offered 
for all 54 aspects. 


6. CONCLUSION 


Existing researches that focus on domain-specific requirements in relation to DQA are quite consider- 
able, and work done in relation to data dimensionality is also widespread. A thorough literature study was used 
in this research to clarify the landscape of data quality evaluation. We supplemented our evaluation with 100 
research publications on data quality published during 2016 and 2021. This study’s findings reveal a substantial 
trend in data quality research publishing. Each year, the number of papers on data quality grows dramatically, 
this demonstrates the significance of data quality research across a variety of study disciplines, including online 
users, databases, web information, sensors, and big data. As a result, this study will not only help academics 
and practitioners, but it will also give support and insight for future research on data quality evaluation. Then, 
we considered that our objective was met and that we answered each of the research questions. As future 
work, we intend to take advantage of this SLR to contribute to the implementation of a new data quality model 
including all relevant quality dimensions as well as their metrics needed to perform an effective DQA. 
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